Europaisches 
Patentamt 



19. 08.2004 

European Office europeen 

Patent Office des brevets 



RECD 0 7 OCT 2004 



WIPO 



PCT 



Bescheinigung Certificate 



Attestation 



Die angehefteten Unter la- 
gen stimmen mit der 
ursprunglich eingereichten 
Fassung der auf dem nach- 
sten Blatt bezeichneten 
europaischen Patentanmel- 
dung Qberein. 



The attached documents 
are exact copies of the 
European patent application 
described on the following 
page, as originally filed. 



Les documents fixes a 
cette attestation sont 
conformes a la version 
initialement deposee de 
la demande de brevet 
europeen specifiee a la 
page suivante. 



Patentanmeldung Nr. Patent application No. Demande de brevet n° 

03388055.0 



PRIORITY 
DOCUMENT 

cSBSSssfflsasss* 



Der Prasldent des Europaischen Patentamts; 
Im Auftrag 

For the President of the European Patent Office 

Le President de I'Office europeen des brevets 
p.o. 



R C van Dijk 




Europalsches 
Patentamt 



European 
Patent Office 



Office europeen 
des brevets 



Anmeldung Nr: 

Application no.: 03388055.0 
Demande no: 



Anmeldetag: 

Date of filing: 21.08.03 
Date de depot: 



Anmel der/Appl 1cant( s)/Demandeur( s) : 

BERNAFON AG 
Morgenstrasse 131 
3018 Bern 
SUISSE 



Bezelchnung der Erf 1ndung/Tl tie of the 1nvent1on/Tl tre de V Invention: 
(Falls die Bezelchnung der Erflndung nlcht angegeben 1st, slehe Beschrelbung. 
If no title 1s shown please refer to the description. 
S1 aucun tltre n'est 1nd1qu6 se ref erer ft la description.) 

Method for processing audio-signals 

In Anspruch genommene Prlorlat(en) / Priori ty( 1es) claimed /PHorltG(s) 
revendlquee(s) 

Staat/Tag/Aktenze1chen/State/Date/Flle no. /Pays/Da te/Numero de depot: 



Internationale Patentklasslf 1 kat1 on/International Patent Classification/ 
Classification Internationale des brevets: 

H04R25/00 

Am Anmeldetag benannte Vertrag staa ten/Con tractl ng states designated at date of 
flUng/Etats contractants designees lors du depot: 



AT BE BG GH GY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL 
PT RO SE SI SK TR LI 



03388055.0 

EPA/EP0/0EB Form 1014.2 - 01.2000 7001014 



2 



TITLE 

Method for processing audio-signals. 



AREA OF THE INVENTION 
The invention is related, to the area of speech enhancement of audio signals, and more 
specifically to a method for processing audio signal in order to enhance speech 
components of the signal whenever they are present. Such methods are particularly 
applicable to hearing aids, where they allow the hearing impaired person to better 
communicate with other people. 

BACKGROUND OF THE INVENTION 

The problem of extracting a signal of interest from noisy observations is well known by 
acoustics engineers. Especially, users of portable speech processing systems often 
encounter the problem of interfering noise reducing the quality and intelligibility of 
speech. To reduce these harmful noise contributions, several single channel speech 
enhancement algorithms have been developed [1-4]. Nonetheless, even though single- 
channel algorithms are able to improve signal quality, recent studies have reported that 
they are still unable to improve speech intelligibility [5]. In contrast, multiple- 
microphone noise reduction schemes have been shown repeatedly to increase speech 
intelligibility and quality [6,7] . 

Multiple microphone speech enhancement algorithms can be roughly classified into 
quasi-stationary spatial filtering and time-variant envelope filtering [8]. Quasi-stationary 
spatial filtering exploits the spatial configuration of the sound sources to reduce noise by 
spatial filter. The filter characteristics do not change with the dynamics of speech but 
with the slower changes in the spatial configuration of the sound sources. They achieve 
almost artefact-free speech enhancement in simple, low reverberating environments and 
computer simulations. Typical examples are adaptive noise cancelling, positive and 
differential beam-forming [30] and blind source separation [28,29]. The most promising 
algorithms of this class proposed hitherto are based on blind source separation (BSS)* 
BSS is the sole technique, which aims to estimate an exact model of the acoustic 
environment and to possibly invert it. It includes the model for de-mixing of a number of 
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acoustic sources from an equal number of spatially diverse recordings. Additionally, 
multi-path propagation, though reverberation is also included in BSS models. 
The basic problem of BSS consists in recovering hidden source signals using only its 
linear mixtures and nothing else. Assume d > statistically independent sources 

_ s(r) =[*,(/)....,*, (03 r ™ 

5 • . These sources are convolved and mixed in a linear medium 

leadingto d * sensor signals X(0 = [je>(0 '-' JC "- ( ' )ir that may include additional noise: 

P 

x(0 = ]£G(r)s(r-r)+n(0. 

0) 

The aim of source separation is to identify the multiple channel transfer characteristics 
G(t), to possibly invert it and to obtain estimates of the hidden sources given by: 

u(0 = i>(r)x(f-z-) 
10 t=o 

where W(t) is the estimated inverse multiple channel transfer characteristics of G(t). 
Numerous algorithms have been proposed for the estimation of the inverse model W(t). 
They are mainly based on the exploitation of the assumption on the statistical 
independence of the hidden source signal. The statistical independence can be exploited 
in different ways and additional constraints can be introduced, such as for example 
intrinsic correlations or non-stationnarity of source signals and/or noise. As a result a 
large number of BSS algorithms under various implementation forms (e.g. time domain, 
frequency domain and time-frequency domain) have been proposed recently for 
multiple-channel speech enhancement (see for example [28,29]). 
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Dogan and Sterns [9] use cumulant based source separation to enhance the signal of 
interest in binaural hearing aids. Rosea et al. [10] apply blind source separation for de- 
mixing delayed and convoluted sources from the signals of a microphone array. A post- 
processing is proposed to improve the enhancement. Jourjine et al. [11] use the statistical 
25 distribution of the signals (estimated using histograms) to separate speech and noise. 
Balan et al. [2] propose an autoregressive (AR) modelling to separate sources from a 
degenerated mixture. Several approaches use the spatial information given by a plurality 
of microphone using beamformers. Koroljow and Gibian [12] use first and second order 
beamformer to adapt the directivity of the hearing aids to the noise conditions. 
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Bhadkamkar and Ngo [3] combine a negative beamformer to extract the speech source 
and a post-processing to remove the reverberation and echoes. Lindemann [13] uses a 
beamformer to extract the energy from the speech source and an omni-directional 
microphone to obtain the whole energy from the speech and noise sources. The ratio 
5 between these two energies allows to enhance the speech signal by a spectral weighting. 
Feng et al. [14] reconstructs the enhanced signal using delayed versions of the signals of 
a binaural hearing aid system. 

BSS techniques have been shown to achieve almost artefact-free speech enhancement in 
10 simple, low reverberating environments, laboratory studies and computer simulations but 
perform poorly for recordings in reverberant environment or/and with diffuse noise. One 
could speculate that in reverberant environments the number of model parameters 
becomes too large to be identified accurately in noisy, non-stationary conditions. 

In contrast, envelope filtering (e.g. Wiener, DCT-Bark, coherence and directional 
filtering) do not yield such failures since they use a simple statistical description of the 
acoustical environment or the binaural interaction in the human auditory system [8]. 
Such algorithms process the signal in an appropriate dual domain. The envelope of the 
target signal or equivalently a short time weighting index (short-time signal-to-noise 
ratio (SNR), coherence) is estimated in several frequency bands. The target is assumed to 
be of frontal incidence and the enhanced signal is obtained by modulating the spectral 
envelope of the noisy signal by the estimated short time weighting index. The adaptation 
of the weighting index has a temporal resolution of about the syllable rate. Dual channel 
approaches based on the statistical description of the sources using the coherence 
function have been presented [1,15-17]. Further improvements have been obtained by 
merging spatial coherence of noisy sound fields, masking properties of the human 
auditory system and subspace approaches [19]. 

Multi-channel speech enhancement algorithms based on envelope filtering are 
30 particularly appropriate for complex acoustic environments, namely diffuse noise and 
highly reverberating. Nevertheless, they are unable to provide loss-less or artefact-free 
enhancement. Globally, they reduce noise contributions in the time-frequency domains 
without any speech contributions. In contrast, in time-frequency domains with speech 
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contributions, the noise cannot be reduced and distortions can be introduced. This is 
mainly the reason why envelope filtering might help reducing the listening effort in 3 
environments but intelligibility improvement is generally leaking [20]. 



The above considerations point out that performance of multiple channel speech 
enhancement algorithms depend essentially on the complexity of the acoustical context. 
A given algorithm is appropriated for a specific acoustic environment and in order to 
cope with changing properties of the acoustic environment composite algorithms have 
been proposed more recently. 

The approach proposed by Melanson and Lindemann in [21] consists in a manual 
switching between different algorithms to enhance speech under various conditions. A 
manual switching between several combinations of filtering and dynamic compression 
has also been proposed by Lindemann et al. [22], 

More advanced techniques using an automatic switching according to different noise 
conditions have been proposed by Killion et al. in [23]. The input of the hearing aid is 
switched automatically between omnidirectional and directional microphone. 

A strategy selective algorithm has been described by Wittkop [24]. This algorithm uses 
an envelope filtering based on a generalized Wiener approach and an envelope filtering 
invoking directional inter-aural level and phase differences. A coherence measure is used 
to identify the acoustical situations and gradually switch off the directional filtering with 
increasing complexity. It is pointed out that this algorithm helps reducing the listening 
effort in noisy environments but that intelligibility improvement is still lacking. 

Therefore, it is the aim of the present invention to provide a composite method including 
source separation and coherence based envelope filtering. Source separation and 
coherence based envelope filtering are achieved in the time Bark domain, i.e. in specific 
frequency bands. Source separation is performed in bands where coherent sound fields of 
the signal of interest or of a predominant noise source are detected. Coherence based 
envelope filtering acts in bands where the sound fields are diffuse and /or where the 
complexity of the acoustic environment is too large. Source separation and coherence 



10 



15 



20 



25 



30 



35 



5 

based envelope filtering may act in parallel and are activated in a smooth way through a 
coherence measure in the Bark bands. 

It is further an issue of the present invention to provide a real binaural enhancement of 
the observed sound field by using the multiple channel transfer characteristics identified 
by source separation. Indeed, commonly speech enhancement algorithms achieve 
mainly a monaural speech enhancement, which implies that users of such devices loose 
the ability to localize sources. A promising solution, which could achieve real binaural 
speech enhancement, consists of a device with one or two microphones in each ear and 
an RF-link in-between. The benefit for the user would be enormous. Notably it has been 
reported that binaural hearing increases the loudness and signal-to-noise ratio of the 
perceived sound, it improves intelligibility and quality of speech and allows the 
localization of sources, which is of prime importance in situations of danger. 
Lindemann and Melanson [25] propose a system with wireless transmission between the 
hearing aids and a processing unit wearied at the belt of the user. Brander [7] similarly 
proposes a direct communication between the two ear devices. Goldberg et aL [26] 
combine the transmission and the enhancement. Finally optical transmission via glasses 
has been proposed by Martin [27]. Nevertheless in none of these approaches a virtual 
reconstruction of the binaural sound filed has been proposed. The approach proposed 
herein, namely exploitation of the multiple channel transfer characteristics identified by 
source separation to reconstruct the real sound field and attenuat noise contribution 
considerably improve the security and the comfort of the listener. 
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SUMMARY OF THE INVENTION 



The invention comprises a method for processing audio-signals whereby audio signals 
are captured at two spaced apart locations and subject to a transformation in the 
perceptual domain (Bark or Mel decomposition), whereupon the enhancement of the 
speech signal is based on the combination of parametric (model based) and non- 
50 parametric (statistical) speech enhancement approaches: 
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a. a source separation process is performed to give a first estimate of the 
wanted signal parts and the noise parts of the microphone signals and 

b. a coherence based envelope filtering is performed to give a second 
estimate of the wanted signal parts of the microphone signals, 

5 and where further a sound field diffiiseness detection is performed on the at least two 
signals, whereby further the sound field diffiiseness detections is used to mix the output 
from the first and the second source separation process in order to achieve the best 
possible signaL The transfer functions estimated by the source separation algorithms are 
used to reconstruct a virtual stereophonic sound field (spatial localisation of the different 

10 sound sources). 

When the speech and noise sources are in the direct sound field (direct path between 
sound sources and microphones is dominant, reverberation is low), the transmission 
transfer function from each source in each source ear system can be estimated and used 

15 to separate speech and noise signals by the use of source separation. These transfer 
functions are estimated using source separation algorithms. The learning of the 
coefficients of the transfer functions can be either supervised (when only the noise 
source is active) or blind (when speech and noise sources are active simultaneously). The 
learning rate in each frequency band can be dependant on the signals characteristics. The 

20 signal obtained with this approach is the first estimated of the clean speech signal. 

When the noise signal is in the reverberant sound field (contributions from 
reverberations is comparable to those of the direct path), source separation approaches 
fails due to the complexity of the transfer functions to be evaluated. A statistical based 

25 envelope filtering can be used to extract speech from noise. The short-time coherence 
function calculated in the transform domain (Bark or Mel) allows estimating a 
probability of presence of speech in each Bark or Mel frequency band. Applying it to the 
noisy speech signal allows to extract the bands where speech is dominant and attenuate 
those where noise is dominant. The signal obtained with this approach is the second 

30 estimate of the clean speech signal. 

These two estimates of the clean speech signal are then mixed to optimise the 
performance of the enhancement. The mixing is performed independently in each 
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frequency band, depending on the sound field characteristic of each frequency band. The 
respective weight for each approach and for each frequency band is calculated from the 
coherence function. 



During the combination of fee signals calculated from the two approaches, the transfer 
functions estimated by source separation are used to reconstruct a virtual stereophonic 
sound field and to recover the spatial information from the different sources. 

In a further embodiment of the invention the sound field diffuseness detection is based 
on the value of a short-time coherence function where the coherence function is 
expressed as: 



r *i*2(*>) = 



xlxl 



This function varies between zero and one, according to the amount of "coherent" signal. 
When the speech signal dominates the frequency band, the coherence is close to one and 
when there is no speech in the frequency band, the coherence is close to zero. Once the 
diffuseness of the sound field is known, the results of the source separation and of the 
coherence based approach can be combined optimally to enhance the speech signals. The 
combination can be the use of one of the approach when the noise source is totally in the 
direct sound field or totally in the diffuse sound field, or a combination of the results 
when some of the frequency bands are in the direct sound field and other are in the 
diffuse sound field. 



BRIEF DESCRIPTION OF THE DRAWINGS 



Fig. 1 is a block diagram of the proposed approach. 

Fig. 2 is a complete mixing model for speech and noise sources. 

Fig. 3 is a modified mixing model. 

Fig. 4 is a De-mixing model, 
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DESCRIPTION OF A PREFERRED EMBODIMENT 

The flim of a hearing aid system is to improve the intelligibility of speech for hearing- 
impaired persons. Therefore it is important to take into account the specificity of the 

5 speech signal. Psycho-acoustical studies have shown that the human perception of 

frequency is not linear with frequency but the sensitivity to frequency changes decreases 
as the frequency of the sound increases. This property of the human hearing system has 
been widely used in speech enhancement and speech recognition system to improve the 
performances of such systems. The use of critical band modeling (Bark or Mel 

10 frequency scale) allows to improve the statistical estimation of the speech and noise 
characteristics and, thus, to improve the quality of the speech enhancement. 

When the speech and noise sources are in the direct sound field (low reverberating 
acoustical environment), the transmission transfer function of each source in each ear 
15 system can be estimated and used to separate the speech and noise signals. The mixing 
system is presented in figure 2. 

The mixing model of figure 2 can be modified to be equivalent to the model of figure 3. 

20 The inversion of the transfer functions H12 and H21 allows recovering the original 
signals up to the modification induced by the transfer function Gl 1 and G22. The de- 
mixing model is presented in figure 4. 

The de-mixing transfer functions W12 and W21 can be estimated using higher order 
25 statistics or time delayed estimation of the cross-correlation between the two. The 

estimation of the model parameters can be either supervised (when only one source is 
active) or blind (when the speech and noise sources are active simultaneously). The 
learning rate of the model parameters can be adjusted according to the nature of the 
sound field condition in each frequency band. The resulting signals are the estimates of 
30 the clean speech and noise signals. 



When the noise source is not in the direct sound field (reverberant environment) the 
mixing transfer functions become complicated and it is not possible to estimate them in 
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real time on a typical processor of a hearing aid system. However, under the assumption 
that the speech source is in the direct sound field, the two channel of the binaural system 
always carry information about the spatial position of the speech source and it can be 
used to enhance the signal. A statistical based weighting approach can be used to extract 
the speech from the noise. The short-time coherence function allows estimating a 
probability of presence of speech. Such a measure defines a weighting function in the 
time-frequency domain. Applying it to the noisy speech signals allows the determination 
of the regions where speech is dominant and to attenuate regions where noise is 
dominant. 

As it was presented previously, two enhancement approaches are used in the proposed 
approach. The aim of the sound field diffuseness detection is to detect the acoustical 
conditions wherein the hearing aid system is working. The detection block gives an 
indication about the diffuseness of the noise source. The result may be that the noise 
source is in the direct sound field, in the diffuse sound field or in-between. The 
information is given for each Bark or Mel frequency band. The coherence function 
presented previously estimates a measure of diffuseness. When the coherence is equal (or 
nearly equal) to one during speech pauses, the noise source is in the direct sound field. 
When it is close to zero, the noise source is in the diffuse sound field. For intermediate 
values, the acoustical environment is between direct and diffuse sound field. 

Once the diffuseness of the sound field is known, the results of the parametric approach 
(source separation) and of the non-parametric approach (coherence) can be combined 
optimally to enhance the speech signals. The combination may be achieved gradually by 
weighing the signal provided by source separation through the diffuseness measure and 
the signal provided by the coherence by the complementary value of the diffuseness 
measure to one. 

As the de-mixing transfer functions have been identified during the source separation, 
they can be used to reconstruct the spatiality of the sound sources. The noise source can 
be added to the enhanced speech signal, keeping its directivity but with reduced level. 
Such an approach offers the advantage that the intelligibility of the speech signal is 
increased (by the reduction of the noise level), but the information about noise sources is 
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kept (this can be useful when the noise source is a danger). By keeping the spatial 
information, the comfort of use is also increased. 
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CLAIMS 

5 

1. Method for processing audio-signals whereby audio signals are captured at two 
spaced apart locations and subject to a transformation in perceptual domain, 
whereupon: 

a. a source separation process is performed to give a first estimate of the 
10 wanted signal parts and the noise parts of the microphone signals and 

c. a coherence based envelope filtering is performed to give a second 

estimate of the wanted signal parts of the microphone signals, and where 
further a sound field diffuseness detection is performed on the at least two 
signals, 

15 whereby further the sound field diffuseness detections is used to mix the output 

from the blind source separation and the coherence based separation process in 
order to achieve the best possible signal. 

2. Method as claimed in claim 1 whereby a virtual stereophonic reconstruction of 
20 the signal is performed prior to presenting the resulting audio signal to right and 

left ear of a person, where by the stereophonic recombination is performed on the 
basis of spatial information on the sound field. 

3. Method as claimed in claims 1, where the sound field diffuseness detection is 
25 based on the value of a short-time coherence function where the coherence 

function is expressed as: 

r xlx2 (k) = ; 

where k is the number of the frequency band in the Bark or Mel frequency space. 
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ABSTRACT 

The invention regards a method for processing audio-signals whereby audio signals 
are captured at two spaced apart locations and subject to a transformation in the 
perceptual domain (Bar or Mel), whereupon: 

a. a (blind or supervised) source separation process is performed to give a 
first estimate of the wanted signal parts and the noise parts of the 
microphone signals and 

b. a coherence based separation process is performed to give a second 
estimate of the wanted signal parts and the noise parts of the microphone 
signals, and where further a sound field diffuseness detection is performed 
on the at least two signals, 

whereby further the sound field diffuseness detections is used to mix the output 
from the blind source separation and the coherence based separation process in 
order to achieve the best possible signal. The transfer functions calculated from 
the source separation are used to reconstruct a virtual stereophonic sound field in 
restore the spatial information about the source position in the enhanced signals. 
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